| Variable | Description | Type |
|---|---|---|
| diabetes_dx | Diabetes diagnosis (1 = Yes, 0 = No) based on medical questionnaire. | Categorical |
| age | Age of participant in years. | Continuous |
| bmi | Body Mass Index (BMI) in kilograms per square meter (kg/m²), calculated from measured height and weight. | Continuous |
| sex | Sex of participant (Male or Female). | Categorical |
| race | Race/Ethnicity (e.g., Non-Hispanic White, Non-Hispanic Black, Mexican American, etc.). | Categorical |
| WTMEC2YR | Examination sample weight for MEC (Mobile Examination Center) participants. | Weight |
| SDMVPSU | Primary Sampling Unit (PSU) used for variance estimation in complex survey design. | Design |
| SDMVSTRA | Stratum variable used to define strata for complex survey design. | Design |
| age_c | Age variable centered and standardized (z-score). | Continuous |
| bmi_c | BMI variable centered and standardized (z-score). | Continuous |
| wt_norm | Normalized survey weight (WTMEC2YR divided by its mean, for model weighting). | Weight |
Bayesian Logistic Regression - Application in Probability Prediction of disease (Diabetes)
CapStone Project_2025
Slides: slides.html ( Go to slides.qmd to edit)
Introduction
Diabetes mellitus (DM) is a major public health concern closely associated with factors such as obesity, age, race, and gender. Identifying these associated risk factors is essential for targeted interventions D’Angelo and Ran (2025). Logistic Regression (traditional) that estimates the association between risk factors and outcomes is insufficient in analyzing the complex healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. Zeger et al. (2020). Classical maximum likelihood estimation (MLE) yields unstable results in samples that are small, have missing data, or presents quasi- and complete separation.
Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) allow analysis of multivariate longitudinal healthcare data with repeated measures within individuals and individuals nested in a population. By integrating prior knowledge and including exogenous (e.g., age, clinical history) and endogenous (e.g., current treatment) covariates, Bayesian models provide posterior distributions and risk predictions for conditions such as pneumonia, prostate cancer, and mental disorders. Parametric assumptions remain a limitation of these models.
In Bayesian inference Chatzimichail and Hatjimihail (2023), Bayesian inference has shown that parametric models (with fixed parameters) often underperform compared to nonparametric models, which do not assume a prior distribution. Posterior probabilities from Bayesian approaches improve disease classification and better capture heterogeneity in skewed, bimodal, or multimodal data distributions. Bayesian nonparametric models are flexible and robust, integrating multiple diagnostic tests and priors to enhance accuracy and precision, though reliance on prior information and restricted access to resources can limit applicability. Combining Bayesian methods with other statistical or computational approaches helps address systemic biases, incomplete data, and non-representative datasets.
The Bayesian framework described by R. van de Schoot et al. (2021) highlights the role of priors, data modeling, inference, posterior sampling, variation inference, and variable selection. Proper variable selection mitigates multicollinearity, overfitting, and limited sampling, improving predictive performance. Priors can be informative, weakly informative, or diffuse, and can be elicited from expert opinion, generic knowledge, or data-driven methods. Sensitivity analysis evaluates the alignment of priors with likelihoods, while MCMC simulations (e.g., brms, blavaan in R) empirically estimate posterior distributions. Spatial and temporal Bayesian models have applications in large-scale cancer genomics, identifying molecular interactions, mutational signatures, patient stratification, and cancer evolution, though temporal autocorrelation and subjective prior elicitation can be limiting.
Bayesian normal linear regression has been applied in metrology for instrument calibration using conjugate Normal–Inverse-Gamma priors Klauenberg et al. (2015). Hierarchical priors add flexibility by modeling uncertainty across multiple levels, improving robustness and interpretability. Bayesian hierarchical/meta-analytic linear regression incorporates both exchangeable and unexchangeable prior information, addressing multiple testing challenges, small sample sizes, and complex relationships among regression parameters across studies Leeuw and Klugkist (2012)
A sequential clinical reasoning model Liu et al. (2013) Sequential clinical reasoning models demonstrate screening by adding predictors stepwise: (1) demographics, (2) metabolic components, and (3) conventional risk factors, incorporating priors and mimicking clinical evaluation. This approach captures ecological heterogeneity and improves baseline risk estimation, though interactions between predictors and external cross-validation remain limitations.
Bayesian multiple imputation with logistic regression addresses missing data in clinical research Austin et al. (2021) in clinical research by classifying missing values (e.g., patient refusal, loss to follow-up, mechanical errors) as MAR, MNAR, or MCAR. Multiple imputation generates plausible values across datasets and pools results for reliable classification of patient health status and mortality.
Aims
This study aims to apply Bayesian logistic regression to predict diabetes status and examine its association with body mass index (BMI), age (≥20 years), gender, and race using data from the 2013–2014 NHANES survey. The NHANES dataset employs a complex sampling design involving stratification, clustering, and oversampling of specific population subgroups. By adopting a Bayesian analytical framework, this study seeks to overcome challenges commonly encountered in traditional logistic regression—such as missing data, separation, and biases introduced by complete case analysis—thereby improving the efficiency, robustness, and interpretability of model estimates in predicting population-level health outcomes.
Method
Bayesian Logistic Regression
To estimate the associations between predictors and outcome probabilities. Bayesian framework integrates prior information with observed data to generate posterior distributions, allowing direct probabilistic interpretation of parameters.
This approach provides flexibility in model specification, accounts for parameter uncertainty, and produces credible intervals that fully reflect uncertainty in the estimates.
Unlike traditional frequentist methods, the Bayesian method enables inference through posterior probabilities, supporting more nuanced decision-making and interpretation.
Model Structure
Bayesian logistic regression is a probabilistic modeling framework used to estimate the relationship between one or more predictors (continuous or categorical) and a binary outcome (e.g., presence/absence of disease).
It extends classical logistic regression by combining it with Bayesian inference, treating model parameters as random variables with probability distributions rather than fixed point estimates Leeuw and Klugkist (2012) and Klauenberg et al. (2015)
Bayesian logistic regression models the log-odds of a binary outcome as a linear combination of predictors:
In the Bayesian framework, model parameters ( ) are treated as random variables and assigned prior distributions that reflect existing knowledge or plausible ranges before observing the data. After incorporating the observed evidence, the priors are updated through Bayes’ theorem (Leeuw and Klugkist 2012; Klauenberg et al. 2015):
In the Bayesian framework, the coefficients () are assigned prior distributions, which are updated in light of the observed data to yield posterior distributions.
The Bayesian approach naturally incorporates uncertainty in all model parameters.
It combines prior beliefs with observed data to produce posterior distributions according to Bayes’ theorem: [ ]
- Likelihood: is the probability of the observed data given the model parameters (as in classical logistic regression).
- Prior: Encodes prior knowledge or beliefs about parameter values before observing the data.
- Posterior: is the updated beliefs about parameters after observing the data.
- Posterior ⋉ Likelihood * Prior
- Likelihood: is the probability of the observed data given the model parameters (as in classical logistic regression).
Prior Specification - Regression coefficients prior: normal(0, 2.5) — providing weakly informative regularization provide gentle regularization, constraining extreme values without overpowering the data R. van de Schoot et al. (2021) - Intercept prior: Student’s t-distribution prior, student_t(3, 0, 10) (van de Schoot et al., 2013).
This weakly informative prior:
- Has 3 degrees of freedom** (( = 3 )), producing heavy tails that allow for occasional large effects.
- Is centered at 0 (( = 0 )), reflecting no initial bias toward positive or negative associations.
- Has a scale parameter of 10 (( = 10 )), allowing broad variation in possible coefficient values.
Such priors improve stability in models with small sample sizes, high collinearity, or potential outliers.
Advantages of Bayesian Logistic Regression
- Uncertainty quantification: Produces full posterior distributions instead of single-point estimates.
- Credible intervals: Provide the range within which a parameter lies with a specified probability (e.g., 95%).
- Flexible priors: Allow incorporation of expert knowledge or results from previous studies.
- Probabilistic predictions: Posterior predictive distributions yield direct probabilities for future observations.
- Comprehensive model checking: Posterior predictive checks (PPCs) evaluate how well simulated outcomes reproduce observed data.
Posterior Predictions
- Posterior distributions of the coefficients are used to estimate the probability of the outcome for given predictor values. This enables statements such as:
“Given the predictors, the probability of the outcome lies between X% and Y%.” - Posterior predictions incorporate two sources of uncertainty:
- Parameter uncertainty: Variability in estimated model coefficients.
- Predictive uncertainty: Variability in future outcomes given those parameters.
- In Bayesian analysis, every unknown parameter — such as a regression coefficient, mean, or variance — is treated as a random variable with a probability distribution that expresses uncertainty given the observed data.
Model Evaluation and Diagnostics
- Model quality and convergence are assessed using standard Bayesian diagnostics:
- Posterior inference performed using Markov Chain Monte Carlo (MCMC) sampling via the No-U-Turn Sampler (NUTS), a variant of Hamiltonian Monte Carlo (HMC). Four chains run with sufficient warm-up iterations to ensure convergence. Austin et al. (2021)
- Convergence diagnostics: Markov Chain Monte Carlo (MCMC) performance was evaluated using ( ) (R-hat) and effective sample size (ESS).
- Autocorrelation checks: Ensure independence between successive MCMC draws.
- Posterior predictive checks (PPC): Compare simulated data from posterior distributions to observed outcomes.
- Bayesian R²: Quantified the proportion of variance explained by the predictors, incorporating uncertainty.
Analysis and Results
Bayesian logistic regression analysis is performed to estimate the risk of doctor-diagnosed diabetes among adults in the National Health and Nutrition Examination Survey (NHANES, 2013-2014), a 2-year data cross-sectional weighted data from the CDC’s National Center for Health Statistics (National Center for Health Statistics (NCHS) 2014 Center for Health Statistics (1999).
Data pre-processing
Data is imported using R packages and libraries for data wrangling and analysis. All computation is implemented in a probabilistic programming environment using the brms interface to Stan.
Three NHANES component files—Demographics (DEMO_H), Body Measures (BMX_H), and the Diabetes Questionnaire (DIQ_H) imported in .xpt format are merged using the unique participant identifier (SEQN) to create a single analytic dataset. Prior to merge, all variables of interest were harmonized and recoded to ensure consistent numeric and factor data types, resulting in a clean, atomic structure suitable for survey-weighted analyses and Bayesian modeling.
The merged dataset with 10,175 participants was filtered to include only adults (age ≥ 20) for subsequent analyses including demographic and anthropometric measures to provide a representative sample for assessing diabetes risk across the U.S. adult population reflecting the complex, multistage sampling design of NHANES.
TABLE
| Step | Description- recoding, categorization, missing data and final data |
|---|---|
| Weighting | Used the survey package to calculate weighted means and standard deviations for all variables. |
| Standardization | Standardized BMI and age variables for analysis. |
| Age Categorization | Recoded into intervals: 20–<30, 30–<40, 40–<50, 50–<60, 60–<70, and 70–80 years. |
| BMI Categorization | Recoded and categorized as: <18.5 (Underweight), 18.5–<25 (Normal), 25–<30 (Overweight), 30–<35 (Obesity I), 35–<40 (Obesity II), ≥40 (Obesity III). |
| Ethnicity Recoding | Recoded as: 1 = Mexican American, 2 = Other Hispanic, 3 = Non-Hispanic White, 4 = Non-Hispanic Black, 5 = Other/Multi. |
| Special Codes | Special codes (e.g., 3, 7) were transformed to NA. These codes are not random and could introduce bias if ignored (MAR or MNAR). |
| Missing Data | Missing values were retained and visualized to assess their pattern and informativeness. |
| Final Dataset | Created a cleaned analytic dataset (adult) using Non-Hispanic White and Male as reference groups for analysis. |
Exploratory Data Analysis (Adult, 20 - 80 years)
Code
library(dplyr)
library(skimr)
library(knitr)
library(tidyr)
library(purrr)
library(forcats)
library(kableExtra)
str(adult)'data.frame': 5769 obs. of 12 variables:
$ SDMVPSU : num 1 1 1 2 1 1 2 1 2 2 ...
$ SDMVSTRA : num 112 108 109 116 111 114 106 112 112 113 ...
$ WTMEC2YR : num 13481 24472 57193 65542 25345 ...
$ diabetes_dx: num 1 1 1 0 0 0 0 0 0 0 ...
$ bmi : num 26.7 28.6 28.9 19.7 41.7 35.7 NA 26.5 22 20.3 ...
$ age : num 69 54 72 73 56 61 42 56 65 26 ...
$ sex : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 2 1 2 1 2 ...
$ race : Factor w/ 5 levels "NH White","Mexican American",..: 4 1 1 1 2 1 3 1 1 1 ...
$ DIQ050 : num 1 1 1 2 2 2 2 2 2 2 ...
$ age_c : num 1.132 0.278 1.303 1.36 0.392 ...
$ bmi_c : num -0.3359 -0.0703 -0.0283 -1.3144 1.761 ...
$ bmi_cat : Factor w/ 6 levels "<18.5","18.5–<25",..: 3 3 3 2 6 5 NA 3 2 2 ...
Below are the quick weighted checks
- mean age of adults in NHANES
- prevalence of diabetes_dx (0/1)
- standard error
- actual NHANES variance, inflated by clustering + stratification
- Variance under simple random sampling (SRS)
- effective sample size for diabetes and the design effect = actual variance / SRS variance)
mean SE
age 47.496 0.3805
mean SE
diabetes_dx 0.089016 0.0048
variance SE
diabetes_dx 4759.9 0.0039
Effective sample size for diabetes_dx: 48142
Missingness
- 25% of columns are discrete (categorical), 75% are continuous (age and BMI).
- 92.7% of rows have complete information for all variables with fully observed data across predictors and outcomes for most participants.
- Only 1.3% of individual data points are missing across the dataset, reflecting minimal missingness.
- No column is entirely missing (0%), indicating all variables have at least some observed data.
- Overall missingness: ~4% → low, but non-trivial given the small number of variables involved.
Study Cohort (Adult)
Cohort includes standardized variables for age and BMI (age_c, bmi_c), categorical indicators for sex and race/ethnicity, and a binary doctor-diagnosed diabetes.
- Age
- 5,769 adults aged 20–80 years
- mean age = 49.1 years (SD = 17.6), and participants are fairly evenly distributed across adult age groups, with no sharp skewness.
- BMI
- The mean BMI was 29.1 kg/m² (SD = 7.15), with values ranging from 14.1 to 82.9 kg/m².
- A small proportion of missing values in variables (BMI and diabetes status).
- BMI data were available for 5,520 participants, with 249 missing values.
- Most participants have BMI values within the normal to overweight range, with fewer in the obese category.
- Sex: Sample includes a higher proportion of females than males.
Prevalence of diabetes in adult population
Diabetes by BMI categories: Individuals diagnosed with diabetes tend to have higher BMI values compared to non-diabetics.
Diabetesby Age Group: The proportion of diabetes increases with advancing age, highlighting age as a strong risk factor.
Diabetes by Race/Ethnicity: Differences are observed across racial/ethnic groups, with some showing higher prevalence rates than others.
Diabetes diagnosis by sex across different racial groups. Bars are side by side for each sex, with counts displayed on top
Statistical Modeling
Two modeling frameworks are compared using identical predictors (standardized age, BMI, sex, and race4) and the binary outcome diabetes_dx:
Survey-weighted logistic regression and complete-case analysi
Bayesian logistic regression with weakly informative priors
Survey-Weighted Logistic Regression (Complete-Case Analysis)
- We conducted a complete-case survey-weighted logistic regression to estimate the association between age, BMI, sex, and race with diabetes status.
- All analyses accounted for the NHANES complex sampling design, including sampling weights, primary sampling units (PSUs), and strata with the sufficiency of outcome data and categorical predictors (sex and race) retained at least two observed levels among participants with non-missing diabetes outcomes.
- For a complete-case indicator to retain only participants with non-missing values on the outcome and all model predictors (age, BMI, sex, and race).
- Complete-case dataset to fit a survey-weighted logistic regression model:
diabetes_dx∼age_c + bmi_c + sex + race
- The model was estimated using svyglm() with a quasibinomial family to accommodate the binary outcome under the NHANES design.
- This generated design-adjusted coefficient estimates, standard errors, and odds ratios reflecting population-representative associations for U.S. adults.
Code
# Modeling
library(broom)
library(mice)
library(brms)
library(posterior)
library(bayesplot)
library(knitr)
# --- Guardrails for modeling ---
n_outcome <- sum(!is.na(adult$diabetes_dx))
if (n_outcome == 0) stop("Too few non-missing outcomes for modeling. n = 0")
# Ensure factors and >=2 observed levels among complete outcomes
adult <- adult %>%
dplyr::mutate(
sex = if (!is.factor(sex)) factor(sex) else sex,
race = if (!is.factor(race)) factor(race) else race
)
if (nlevels(droplevels(adult$sex[!is.na(adult$diabetes_dx)])) < 2)
stop("sex has <2 observed levels after filtering; check data availability.")
if (nlevels(droplevels(adult$race[!is.na(adult$diabetes_dx)])) < 2)
stop("race has <2 observed levels after filtering; check Data Prep.")
# Survey-weighted complete-case
# Build a logical filter on the original adult data (same length as design$data)
keep_cc <- with(
adult,
!is.na(diabetes_dx) & !is.na(age_c) & !is.na(bmi_c) &
!is.na(sex) & !is.na(race)
)
# Subset the survey design using the logical vector (same length as original)
des_cc <- subset(nhanes_design_adult, keep_cc)
# Corresponding complete-case data (optional)
cc <- adult[keep_cc, ] |> droplevels()
cat("\nComplete-case N for survey-weighted model:", nrow(cc), "\n")
Complete-case N for survey-weighted model: 5349
Code
form_cc <- diabetes_dx ~ age_c + bmi_c + sex + race
svy_fit <- survey::svyglm(formula = form_cc, design = des_cc, family = quasibinomial())
summary(svy_fit)
Call:
svyglm(formula = form_cc, design = des_cc, family = quasibinomial())
Survey design:
subset(nhanes_design_adult, keep_cc)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.67143 0.11935 -22.383 1.68e-08 ***
age_c 1.10833 0.05042 21.981 1.94e-08 ***
bmi_c 0.63412 0.05713 11.099 3.88e-06 ***
sexFemale -0.63844 0.10926 -5.843 0.000386 ***
raceMexican American 0.71091 0.13681 5.196 0.000826 ***
raceOther Hispanic 0.46469 0.13474 3.449 0.008712 **
raceNH Black 0.51221 0.15754 3.251 0.011677 *
raceOther/Multi 0.84460 0.17756 4.757 0.001433 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 0.8455444)
Number of Fisher Scoring iterations: 6
Code
svy_or <- broom::tidy(svy_fit, conf.int = TRUE) %>%
dplyr::mutate(OR = exp(estimate), LCL = exp(conf.low), UCL = exp(conf.high)) %>%
dplyr::select(term, OR, LCL, UCL, p.value) %>%
dplyr::filter(term != "(Intercept)")
knitr::kable(svy_or, caption = "Survey-weighted odds ratios (per 1 SD)")| term | OR | LCL | UCL | p.value |
|---|---|---|---|---|
| age_c | 3.0292807 | 2.6967690 | 3.4027912 | 0.0000000 |
| bmi_c | 1.8853571 | 1.6526296 | 2.1508579 | 0.0000039 |
| sexFemale | 0.5281132 | 0.4104905 | 0.6794397 | 0.0003857 |
| raceMexican American | 2.0358434 | 1.4850041 | 2.7910081 | 0.0008262 |
| raceOther Hispanic | 1.5915182 | 1.1664529 | 2.1714810 | 0.0087119 |
| raceNH Black | 1.6689718 | 1.1605895 | 2.4000450 | 0.0116773 |
| raceOther/Multi | 2.3270527 | 1.5451752 | 3.5045697 | 0.0014331 |
Interpretation
- Age (age_c) Estimate: 1.10833 (p < 0.0001) For each 1 SD increase in age (if centered & scaled):The log-odds of diabetes increase by 1.108.The odds increase by exp(1.108) ≈ 3.03.
- Older adults have ~3 times higher odds of diabetes per unit increase in age_c. BMI (bmi_c): Estimate: 0.63412 (p < 0.0001) For each 1-unit increase in centered BMI:Odds of diabetes increase by exp(0.634) ≈ 1.88.
- Sex (reference = Male): sexFemale
- Estimate: –0.63844 (p < 0.001) Females have:Lower log-odds of diabetes than males.Odds ratio = exp(–0.638) ≈ 0.53. Females have about 47% lower odds of diabetes compared to males, after adjusting for age, BMI, and race.
- Race (reference = Non-Hispanic White) Mexican American Estimate: 0.71091 OR = exp(0.711) ≈ 2.04 About 2× higher odds of diabetes compared to NH Whites.
- Other Hispanic Estimate: 0.46469 OR = exp(0.465) ≈ 1.59 ~59% higher odds compared to NH Whites.
- Non-Hispanic Black Estimate: 0.51221 OR = exp(0.512) ≈ 1.67 ~67% higher odds compared to NH Whites.
- Other/Multi-racial Estimate: 0.84460 OR = exp(0.845) ≈ 2.33 More than twice the odds compared to NH Whites.
Overall Conclusion
Age and BMI are very strong predictors of diabetes. Females are significantly less likely to have diabetes than males. Several racial/ethnic groups (Mexican American, NH Black, Other/Multi) have significantly higher odds than Non-Hispanic Whites.
All associations remain after adjusting for other variables.
Bayesian Logistic Regression
Bayesian models assume complete data. Multivariate Imputation by Chained Equations (MICE) provides multiple realistic versions of missing data. Pooling across posterior distributions yields the correct full posterior:
Posterior (parameters + missing data uncertainty)
Multivariate Imputation by Chained Equations (Pooled Logistic Regression)
Multiple Imputation (MICE) was applied to impute missing predictors using robust chained‐equation models. Five multiply imputed datasets were created, each representing a plausible version of the complete data.
- MICE Buuren and Groothuis-Oudshoorn (2011), manages missiging data
- Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper performance of multiple imputation for the mean structure (n > 400) even for high percentages (75%) of missing data in one variable Van Buuren and Van Buuren (2012).
- Multiple Imputation (MI) in R package, imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
- In the chain process, each imputed variable become a predictor for the subsequent imputation, and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.
Code
# ----- Multiple Imputation (predictors only)
mi_dat <- adult %>%
dplyr::select(diabetes_dx, age, bmi, sex, race, WTMEC2YR, SDMVPSU, SDMVSTRA)
meth <- mice::make.method(mi_dat)
pred <- mice::make.predictorMatrix(mi_dat)
# Do not impute outcome
meth["diabetes_dx"] <- ""
pred["diabetes_dx", ] <- 0
pred[,"diabetes_dx"] <- 1
# Imputation methods
meth["age"] <- "norm"
meth["bmi"] <- "pmm"
meth["sex"] <- "polyreg"
meth["race"] <- "polyreg"
# Survey design vars as auxiliaries only
meth[c("WTMEC2YR","SDMVPSU","SDMVSTRA")] <- ""
pred[, c("WTMEC2YR","SDMVPSU","SDMVSTRA")] <- 1
imp <- mice::mice(mi_dat, m = 5, method = meth, predictorMatrix = pred, seed = 123)
iter imp variable
1 1 bmi
1 2 bmi
1 3 bmi
1 4 bmi
1 5 bmi
2 1 bmi
2 2 bmi
2 3 bmi
2 4 bmi
2 5 bmi
3 1 bmi
3 2 bmi
3 3 bmi
3 4 bmi
3 5 bmi
4 1 bmi
4 2 bmi
4 3 bmi
4 4 bmi
4 5 bmi
5 1 bmi
5 2 bmi
5 3 bmi
5 4 bmi
5 5 bmi
Imputation was performed on the survey weighted adult data (obs = 5769)
- BMI was imputed using PMM
- m = 5 multiply imputed datasets
- Outcome variable is not impute the outcome
- After imputation, we run the model in each dataset and get the pool of all the estimates.
- Final pooled imputed dataset consisted of 5592 obs
- The iterative process showed stable convergence, indicating reliable estimation of missing BMI values for subsequent Bayesian modeling analyses
Code
glimpse(adult_imp1)Rows: 5,592
Columns: 11
$ diabetes_dx <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ age <dbl> 69, 54, 72, 73, 56, 61, 42, 56, 65, 26, 76, 33, 32, 38, 50…
$ bmi <dbl> 26.7, 28.6, 28.9, 19.7, 41.7, 35.7, 23.6, 26.5, 22.0, 20.3…
$ sex <fct> Male, Male, Male, Female, Male, Female, Male, Female, Male…
$ race <fct> NH Black, NH White, NH White, NH White, Mexican American, …
$ WTMEC2YR <dbl> 13481.04, 24471.77, 57193.29, 65541.87, 25344.99, 61758.65…
$ SDMVPSU <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ SDMVSTRA <dbl> 112, 108, 109, 116, 111, 114, 106, 112, 112, 113, 116, 114…
$ age_c <dbl> 1.13241831, 0.27835981, 1.30323001, 1.36016725, 0.39223428…
$ bmi_c <dbl> -0.33319172, -0.06755778, -0.02561558, -1.31184309, 1.7639…
$ wt_norm <dbl> 0.3393916, 0.6160884, 1.4398681, 1.6500477, 0.6380722, 1.5…
Study population across the three datasets
- Rows: 10175 and Columns: 10 (survey-weighted, merged data)
- Rows: 5,769 and Columns: 12 (filtered Adult data)
- Rows: 5,592 and Columns: 11 (imputed data, adult_imp1)
Bayesian Logistic Regression:
A Bayesian logistic regression model was then fitted to each imputed dataset
- with survey weights-Normalized MEC exam weights (wt_norm) and mean 1.00 (SD 0.79)
- with no missing values
- with continuous variables standardized
- categorical variables re-leveled for reference categoriesand posterior samples were combined across imputations.
Posterior (parameters + missing data uncertainty)
This procedure appropriately propagates uncertainty from both sampling variability and missing data, producing valid Bayesian posterior estimates.
Prior Specification
- Intercept prior: student_t(3, 0, 10) — allowing heavy tails for flexibility in the intercept estimate R. V. D. Schoot et al. (2013)
- Regression coefficients prior: normal(0, 2.5)
- Weakly informative regularization provide gentle regularization, constraining extreme values without overpowering the data R. van de Schoot et al. (2021)
Model Formula
diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race
Family: bernoulli Links:mu = logit Data: adult_imp1 (Number of observations: 5592) Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup draws = 4000 Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS and Tail_ESS are effective sample size measures, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).
Code
library(gt)
priors <- c(
set_prior("normal(0, 2.5)", class = "b"),
set_prior("student_t(3, 0, 10)", class = "Intercept")
)
bayes_fit <- brm(
formula = diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race,
data = adult_imp1,
family = bernoulli(link = "logit"),
prior = priors,
chains = 4, iter = 2000, seed = 123,
control = list(adapt_delta = 0.95),
refresh = 0 # quiet Stan output
)Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported" -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/" -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/" -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
679 | #include <cmath>
| ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
Code
prior_summary(bayes_fit) prior class coef group resp dpar nlpar lb ub
normal(0, 2.5) b
normal(0, 2.5) b age_c
normal(0, 2.5) b bmi_c
normal(0, 2.5) b raceMexicanAmerican
normal(0, 2.5) b raceNHBlack
normal(0, 2.5) b raceOtherDMulti
normal(0, 2.5) b raceOtherHispanic
normal(0, 2.5) b sexFemale
student_t(3, 0, 10) Intercept
tag source
user
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
user
Code
summary(bayes_fit) # Bayesian model summary Family: bernoulli
Links: mu = logit
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race
Data: adult_imp1 (Number of observations: 5592)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.66 0.09 -2.84 -2.50 1.00 4187 3510
age_c 1.09 0.06 0.97 1.22 1.00 3012 3098
bmi_c 0.63 0.05 0.53 0.72 1.00 3472 3315
sexFemale -0.66 0.10 -0.86 -0.46 1.00 4003 3052
raceMexicanAmerican 0.69 0.18 0.35 1.04 1.00 3526 2843
raceOtherHispanic 0.43 0.24 -0.07 0.89 1.00 4058 3114
raceNHBlack 0.54 0.15 0.24 0.82 1.00 3597 3177
raceOtherDMulti 0.82 0.19 0.45 1.19 1.00 3763 3257
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Results and Visualization
MCMC Diagnostics
MCMC Trace Plots (Right Panels)
Four Markov Chain Monte Carlo (MCMC) chains, each with 2000 iterations (50% warm-up), and an adaptive delta of 0.95 ensured good chain convergence and reduced divergent transitions.
Each Trace Plots shows 4 MCMC chains (different colors) across 2000 iterations
The chains mix well and overlap substantially, without visible trends or drifts → indicates good convergence.
The parameter values oscillate around stable means with no systematic pattern → confirms stationarity.
Combined with Rhat ≈ 1 and high ESS from your summary, the trace plots visually validate posterior convergence and independence.
MCMC Convergence: well-mixed chains without trends indicate convergence and stable posterior estimates.
Plots of posterior distributions with high uncertainty, narrow distributions indicating precise estimates.
All distributions look smooth and unimodal → no multimodality, confirming stable posteriors.
Each histogram represents the distribution of sampled coefficient values after convergence across all MCMC draws
b_raceOtherHispanic: The posterior peaks around 0.4–0.5, with some spread below 0 and above 1.→ Suggests a modestly positive association with diabetes risk, but some uncertainty (credible interval overlaps 0).
b_raceNHBlack: Centered around 0.5–0.6, with a narrower, symmetric shape. → Indicates a consistent positive effect—NH Black participants have higher odds of diabetes, and uncertainty is low.
b_raceOtherDMulti: Centered around 0.8–0.9, with slightly wider spread but entirely above 0. → Stronger evidence for increased odds of diabetes among Other/Multi-racial individuals.
Summary stats of the selected parameters from posterior draws
Code
# Posterior ORs (drop intercept, clean labels)
bayes_or <- posterior_summary(bayes_fit, pars = "^b_") %>%
as.data.frame() %>%
tibble::rownames_to_column("raw") %>%
dplyr::mutate(
term = gsub("^b_", "", raw),
term = gsub("race", "race:", term),
term = gsub("sex", "sex:", term),
term = gsub("OtherDMulti", "Other/Multi", term),
term = gsub("OtherHispanic", "Other Hispanic", term),
OR = exp(Estimate),
LCL = exp(Q2.5),
UCL = exp(Q97.5)
) %>%
dplyr::select(term, OR, LCL, UCL) %>%
dplyr::filter(term != "Intercept")
knitr::kable(
bayes_or %>%
dplyr::mutate(dplyr::across(c(OR,LCL,UCL), ~round(.x, 2))),
digits = 2,
caption = "Bayesian posterior odds ratios (95% CrI) — reference: NH White (race), Male (sex)"
)| term | OR | LCL | UCL |
|---|---|---|---|
| age_c | 2.99 | 2.64 | 3.37 |
| bmi_c | 1.87 | 1.71 | 2.05 |
| sex:Female | 0.52 | 0.42 | 0.63 |
| race:MexicanAmerican | 2.00 | 1.41 | 2.84 |
| race:Other Hispanic | 1.54 | 0.93 | 2.43 |
| race:NHBlack | 1.71 | 1.28 | 2.27 |
| race:Other/Multi | 2.27 | 1.56 | 3.28 |
Posterior Draws
Using 4,000 posterior draws from the Bayesian model (4 chains × 1,000 post-warmup draws per chain), the mcmc_areas visualized the posterior distributions of key predictors: age, BMI, sex, and race/ethnicity.
The posterior densities show the range and uncertainty of each coefficient and the 95% credible intervals depicted by the shaded areas, highlighting which predictors have strong evidence of association with diabetes.
Reference: Male , Non-Hispanic White. Age and BMI are standardized (per 1 SD).* - represent the central tendency and uncertainty around the model parameters through credible intervals (CrI).
The Bayesian logistic regression model identified significant associations between demographic and anthropometric factors and diabetes diagnosis.
- Bayesian posterior odds ratios (95% CrI) with reference group being: NH White, Male, corresponds to very low probability of diabetes (~7%).
- Age a strong predictor: For each 1 SD increase in age, the odds of diabetes are about 3 times higher, with high certainty (credible interval well above 0) and higher odds of diabetes (OR = 2.99; 95% CrI = 2.64–3.37).
- BMI showed a strong positive association (OR = 1.87; 95% CrI = 1.71–2.05), higher body mass substantially increased diabetes risk.
- Female sex had lower odds of diabetes compared to males (OR = 0.52; 95% CrI = 0.42–0.63).
- Compared with Non-Hispanic Whites (reference group), several racial/ethnic groups had higher odds:
- Mexican Americans (OR = 2.00; 95% CrI = 1.41–2.84)
- Non-Hispanic Blacks (OR = 1.71; 95% CrI = 1.28–2.27)
- Other/Multi-racial individuals (OR = 2.27; 95% CrI = 1.56–3.28)
- Other Hispanics showed a positive but non-significant association (OR = 1.54; 95% CrI = 0.93–2.43).
Below is the Summary of all parameters
Code
library(brms)
library(dplyr)
library(tibble)
# Extract posterior draws
post <- posterior_summary(bayes_fit, robust = FALSE) %>%
as_tibble(rownames = "term")
# Extract convergence diagnostics
diag <- rstan::monitor(as.array(bayes_fit$fit), print = FALSE) %>%
as_tibble(rownames = "term") %>%
select(term, Rhat, n_eff = Bulk_ESS)
# Combine summaries + diagnostics
bayes_results <- post %>%
left_join(diag, by = "term") %>%
mutate(
OR = exp(Estimate),
OR_LCL = exp(Q2.5),
OR_UCL = exp(Q97.5)
) %>%
select(
term,
Estimate,
Est.Error,
Q2.5,
Q97.5,
Rhat,
n_eff,
OR,
OR_LCL,
OR_UCL
)
bayes_results# A tibble: 11 × 10
term Estimate Est.Error Q2.5 Q97.5 Rhat n_eff OR OR_LCL
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 b_Intercept -2.66e+0 0.0884 -2.84e+0 -2.50e+0 1.00 2141 6.96e-2 5.84e-2
2 b_age_c 1.09e+0 0.0622 9.72e-1 1.22e+0 1.00 1575 2.99e+0 2.64e+0
3 b_bmi_c 6.27e-1 0.0476 5.34e-1 7.20e-1 1.000 1870 1.87e+0 1.71e+0
4 b_sexFemale -6.59e-1 0.102 -8.59e-1 -4.58e-1 1.01 2143 5.18e-1 4.23e-1
5 b_raceMexic… 6.92e-1 0.177 3.47e-1 1.04e+0 1.00 1741 2.00e+0 1.41e+0
6 b_raceOther… 4.31e-1 0.244 -7.16e-2 8.87e-1 1.01 2147 1.54e+0 9.31e-1
7 b_raceNHBla… 5.38e-1 0.147 2.43e-1 8.18e-1 1.00 1685 1.71e+0 1.28e+0
8 b_raceOther… 8.19e-1 0.189 4.45e-1 1.19e+0 1.00 1915 2.27e+0 1.56e+0
9 Intercept -2.67e+0 0.0678 -2.81e+0 -2.55e+0 1.00 1525 6.90e-2 6.03e-2
10 lprior -1.65e+1 0.0531 -1.66e+1 -1.64e+1 1.00 1304 6.81e-8 6.07e-8
11 lp__ -1.43e+3 2.04 -1.44e+3 -1.43e+3 1.00 865 0 0
# ℹ 1 more variable: OR_UCL <dbl>
Code
summary(bayes_fit) Family: bernoulli
Links: mu = logit
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race
Data: adult_imp1 (Number of observations: 5592)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.66 0.09 -2.84 -2.50 1.00 4187 3510
age_c 1.09 0.06 0.97 1.22 1.00 3012 3098
bmi_c 0.63 0.05 0.53 0.72 1.00 3472 3315
sexFemale -0.66 0.10 -0.86 -0.46 1.00 4003 3052
raceMexicanAmerican 0.69 0.18 0.35 1.04 1.00 3526 2843
raceOtherHispanic 0.43 0.24 -0.07 0.89 1.00 4058 3114
raceNHBlack 0.54 0.15 0.24 0.82 1.00 3597 3177
raceOtherDMulti 0.82 0.19 0.45 1.19 1.00 3763 3257
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
library(brms)
plot(bayes_fit) # Posterior distributionsPosterior Coefficients (log-odds scale)
Intercept: b_Intercept = –2.66 (SD = 0.088, LCL ≈ –2.84)
- Baseline log-odds of diabetes for: male, NH White, average age, average BMI with ~ 6.5% baseline diabetes probability.
Age (per 1 SD): b_age_c = 1.094 (SD = 0.062, LCL ≈ 0.97)
- Very strong positive effect.
- Interpreting as odds ratio: OR ≈ exp(1.094) ≈ 2.99 A 1-SD increase in age almost triples the odds of diabetes.
- 95% credible interval excludes 0 → strong evidence of association.
BMI (per 1 SD): b_bmi_c = 0.627 (SD = 0.048, LCL ≈ 0.53)
- Also strong positive effect.
- OR ≈ exp(0.627) ≈ 1.87
- A 1-SD BMI increase raises the odds of diabetes by ~87%.
- Very tight credible interval → high certainty.
Sex (Female) b_sexFemale = –0.659 (SD = 0.102, LCL ≈ –0.859)
- Women have lower odds of diabetes than men (Reference = Male)
- OR ≈ exp(–0.659) ≈ 0.52
- Females have ~48% lower odds compared with males, holding age, BMI, race constant.
Race
Mexican American: b_raceMexicanAmerican = 0.692 (SD = 0.177, LCL ≈ 0.347)
- Increased odds of diabetes.
- OR ≈ exp(0.692) ≈ 1.99
- Nearly double the odds compared to NH Whites.
Other Hispanic: b_raceOtherHispanic = 0.431 (SD = 0.244, LCL ≈ –0.072)
- Positive mean, but lower bound crosses 0.
- OR ≈ exp(0.431) ≈ 1.54
- Evidence is weaker that “Other Hispanic” have higher odds than NH Whites.
NH Black: b_raceNHBlack = 0.538 (SD = 0.147, LCL ≈ 0.243)
- Strong positive association.
- OR ≈ exp(0.538) ≈ 1.71
- NH Black individuals have ~70% higher odds vs NH Whites.
Other / Multiracial: b_raceOtherDMulti = 0.819 (SD = 0.189, LCL ≈ 0.445)
Strong evidence of increased risk.
OR ≈ exp(0.819) ≈ 2.27
Over twice the odds compared to NH Whites
Interpretation
- Age and BMI showed positive associations
- BMI showed a negative association in most draws, with posterior estimates ranging roughly from 0.61 to 0.70, indicating that higher BMI is associated with lower odds of the outcome in this analysis
- Age showed a positive association, with posterior estimates ranging roughly from 1.00 to 1.14, suggesting that higher age increases the odds of the outcome.
- Female sex showed a negative association
- Racial/ethnic groups had elevated odds relative to the reference group
Prior vs Posterior Examination
Code
library(brms)
library(dplyr)
library(posterior)
library(bayesplot)
library(ggplot2)
post_wide <- as_draws_df(bayes_fit) # or posterior_samples(bayes_fit)
names(post_wide) # should show: b_Intercept, b_bmi_c, b_age_c, etc. [1] "b_Intercept" "b_age_c" "b_bmi_c"
[4] "b_sexFemale" "b_raceMexicanAmerican" "b_raceOtherHispanic"
[7] "b_raceNHBlack" "b_raceOtherDMulti" "Intercept"
[10] "lprior" "lp__" ".chain"
[13] ".iteration" ".draw"
Code
mcmc_areas(
post_wide,
pars = c("b_age_c","b_bmi_c","b_sexFemale",
"b_raceMexicanAmerican","b_raceOtherHispanic",
"b_raceNHBlack","b_raceOtherDMulti")
)The posterior distributions represents density curve with the uncertainty around the parameter’s posterior mean for age and BMI
Age and BM: strong positive associations with diabetes status, as their posterior distributions are concentrated above zero, suggesting higher values increase the likelihood of diabetes.
Sex (Female) has a distribution centered slightly below zero, indicating a lower probability of diabetes compared to males.
Race categories (Mexican American, Other Hispanic, Non-Hispanic Black, and Other/Multi) show broader distributions with varying levels of uncertainty relative to the reference group (Non-Hispanic White).
The shaded regions indicate the 80% credible intervalswithin which the true parameter values are most likely to lie based on the posterior samples.
Posterior Predictive Checking (PPC)
Code
library(posterior)
library(dplyr)
library(tidyr)
# Extract posterior draws as a matrix, then convert to tibble
post <- as_draws_matrix(bayes_fit) %>% # safer than as_draws_df for manipulation
as.data.frame() %>%
select(b_bmi_c, b_age_c) %>%
pivot_longer(
everything(),
names_to = "term",
values_to = "estimate"
) %>%
mutate(
term = case_when(
term == "b_bmi_c" ~ "BMI (per 1 SD)",
term == "b_age_c" ~ "Age (per 1 SD)"
),
type = "Posterior"
)
prior_draws <- tibble(
term = rep(c("BMI (per 1 SD)", "Age (per 1 SD)"), each = 4000),
estimate = c(rnorm(4000, 0, 1), rnorm(4000, 0, 1)),
type = "Prior"
)
combined_draws <- bind_rows(prior_draws, post)Code
library(ggplot2)
ggplot(combined_draws, aes(x = estimate, fill = type)) +
geom_density(alpha = 0.4) +
facet_wrap(~ term, scales = "free", ncol = 2) +
theme_minimal(base_size = 13) +
labs(
title = "Prior vs Posterior Distributions",
x = "Coefficient estimate",
y = "Density",
fill = ""
)Prior vs Posterior Distributions :
Density plots show an overlay of the prior and posterior distributions for BMI and age
Posterior distributions were narrower and shifted away from zero relative to the priors, indicating that the data provided substantial evidence for positive associations of BMI and age with diabetes risk.
The shift from diffuse priors to concentrated posteriors demonstrates the model’s ability to incorporate empirical evidence and refine prior beliefs.
Narrower posterior distributions reflect reduced uncertainty in parameter estimates.
Posterior Predictive Checks (PPC)
We generates 500 predicted outcomes for each observation using samples from the posterior distribution (ndraws = 500) to simulate predicted data to compare these simulated outcomes with the actual observed data to see if the model can reasonably reproduce the patterns in the data
- Predicted outcomes: binary outcome with 0 or 1 simulated
- Presented below are the rows (500) and columns = number of observations (5592)
Code
# Posterior predictive draws
#Posterior predictive checks (binary outcome)
pp_samples <- posterior_predict(bayes_fit, ndraws = 500) # 500 draws
# Check dimensions
dim(pp_samples) # rows = draws, cols = observations[1] 500 5592
Code
ppc_bars(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:500, ])ppc_bars() plot - plot compared the observed and predicted diabetes outcomes - Model’s predictions align with reality where mean(y_rep) = average predicted probability of diabetes for each individual, across all posterior draws of the parameters and y = the actual observed diabetes status (0 = non-diabetic, 1 = diabetic).
Posterior Predictive stats: Mean
Code
#PP check for proportions (useful for binary) mean comparison to check if the simulated means match the observed mean
## mean
ppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat = "mean") +
labs(title = "Posterior Predictive Check: Mean of Replicates") +
theme_minimal()- A posterior predictive check using 100 replicated datasets from the posterior distribution shows the mean diabetes outcome closely aligned with the observed mean, suggesting that the Bayesian model accurately captures the centraltendency of the outcome.
Posterior Predictive stats: Standard Deviation
Code
#PP check for proportions (useful for binary) mean and sd comparison to check if the simulated means match the observed mean
## sd
ppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat = "sd") +
labs(title = "PPC: Standard Deviation of Replicates") +
theme_minimal()- A posterior predictive check on the standard deviation of diabetes outcomes using 100 replicated datasets from the posterior distribution closely matched the observed value, indicating that the Bayesian model adequately captures the variability in the outcome data.
Observed vs. Predicted Average diabetes outcome across individuals
Code
# PP checks with bayesplot options
color_scheme_set("blue")
ppc_scatter_avg(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ]) +
labs(title = "Observed vs Predicted (Avg) Posterior Predictive at individual level")The plot shows predicted probabilities of observed outcomes, across individuals and checks model calibration Logistic regression fit at person-level.
Each point = one observation (an individual).
x-axis = observed value (
0or1in your binary diabetes variable).y-axis = average predicted posterior probability for that same individual across simulated datasets.
Points near (0, 0) or (1, 1) → good prediction.
Observed vs. Predicted estimates for BMI
Code
predicted <- fitted(bayes_fit, summary = TRUE)
observed <- adult_imp1[, c("bmi", "age")]
# Plot for **bmi** (obs vs pred)
ggplot(data = NULL, aes(x = observed$bmi, y = predicted[, "Estimate"])) +
geom_point() +
geom_errorbar(aes(ymin = predicted[, "Q2.5"], ymax = predicted[, "Q97.5"])) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
xlab("Observed bmi") + ylab("Predicted bmi")- Comparison of observed vs and posterior predicted estimates for BMI
- Each point represents an individual’s observed BMI versus the model’s predicted mean.
- Error bars indicate the 95% credible intervals of the predictions.
- The plot demonstrates that the model’s predictions generally align with the observed data, with most points closely following the diagonal, indicating good predictive performance for BMI.
Model-Level Outcomes
Code
# Results
fmt_or <- function(or, lcl, ucl, digits = 2) {
paste0(
formatC(or, format = "f", digits = digits), " (",
formatC(lcl, format = "f", digits = digits), "–",
formatC(ucl, format = "f", digits = digits), ")"
)
}
svy_tbl <- svy_or %>% mutate(Model = "Survey-weighted MLE")
bayes_tbl <- bayes_or %>% mutate(Model = "Bayesian")
# Remove mi_tbl from bind_rows
all_tbl <- bind_rows(svy_tbl, bayes_tbl) %>%
mutate(term = case_when(
str_detect(term, "bmi_c|\\bBMI\\b") ~ "BMI (per 1 SD)",
str_detect(term, "age_c|\\bAge\\b") ~ "Age (per 1 SD)",
TRUE ~ term
)) %>%
filter(term %in% c("BMI (per 1 SD)", "Age (per 1 SD)")) %>%
mutate(OR_CI = fmt_or(OR, LCL, UCL, digits = 2)) %>%
select(Model, term, OR_CI) %>%
arrange(
factor(Model, levels = c("Survey-weighted MLE","Bayesian Regression")),
factor(term, levels = c("BMI (per 1 SD)","Age (per 1 SD)"))
)
res_wide <- all_tbl %>%
pivot_wider(names_from = term, values_from = OR_CI) %>%
rename(
`BMI (per 1 SD) OR (95% CI)` = `BMI (per 1 SD)`,
`Age (per 1 SD) OR (95% CI)` = `Age (per 1 SD)`
)
kable(
res_wide,
align = c("l","c","c"),
caption = "Odds ratios (per 1 SD) with 95% CIs across models"
)| Model | BMI (per 1 SD) OR (95% CI) | Age (per 1 SD) OR (95% CI) |
|---|---|---|
| Survey-weighted MLE | 1.89 (1.65–2.15) | 3.03 (2.70–3.40) |
| Bayesian | 1.87 (1.71–2.05) | 2.99 (2.64–3.37) |
Code
# Compute proportion of diabetes=1 for each draw
pp_proportion <- rowMeans(pp_samples) # proportion of 1's in each posterior draw
# Optional: visualize the posterior probability distribution
pp_proportion_df <- tibble(proportion = pp_proportion)
ggplot(pp_proportion_df, aes(x = proportion)) +
geom_histogram(binwidth = 0.01, fill = "skyblue", color = "black") +
labs(
title = "Posterior Distribution of Proportion of Diabetes = 1",
x = "Proportion of Diabetes = 1",
y = "Frequency"
) +
theme_minimal()Model Comparison Across methods
- survey-weighted maximum likelihood estimation (MLE)
- Bayesian regression
For BMI
- odds ratios ranged from 1.89 (95% CI: 1.65–2.15) in survey-weighted model
- Odds rato of 1.87 (95% CrI: 1.71–2.05) in the Bayesian model
For age
- Odds ration 3.03 (95% CI: 2.70–3.40) from the survey-weighted MLE model
- Odds ratio 2.99 (95% CrI: 2.64–3.37) in the Bayesian analysis ### Posterior Distribution of Diabetes Prevalence
A histogram of these values shows the distribution of predicted prevalence of diabetes calculated for each draw of the Bayesian model
reflecting uncertainty in the model estimates, central tendency and variability of diabetes prevalence
Most posterior predictions cluster around 10–11%, indicating good alignment with the observed/imputed data and demonstrating that the model captures the underlying population pattern.
Code
library(dplyr)
library(knitr)
svy_mean <- svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
# Create summary table WITHOUT MICE row
summary_table <- tibble(
Method = c("Survey-weighted mean (NHANES)",
"Posterior predictive mean"),
diabetes_mean = c(
coef(svy_mean), # survey-weighted mean
mean(pp_proportion) # posterior predictive mean
),
SE = c(
SE(svy_mean), # survey-weighted SE
NA # no SE for posterior predictive
)
)
# Render table
kable(summary_table, digits = 4, caption = "Comparison of Diabetes Prevalence Across Methods")| Method | diabetes_mean | SE |
|---|---|---|
| Survey-weighted mean (NHANES) | 0.0890 | 0.0048 |
| Posterior predictive mean | 0.1091 | NA |
Posterior Predicted Diabetes Proportion vs. NHANES Prevalence
Code
library(tidyverse)
# Posterior predicted proportion vector
# pp_proportion <- rowMeans(pp_samples) # if not already done
known_prev <- 0.089 # NHANES prevalence
# Posterior summary
posterior_mean <- mean(pp_proportion)
posterior_ci <- quantile(pp_proportion, c(0.025, 0.975)) # 95% credible interval
# Create a data frame for plotting
pp_df <- tibble(proportion = pp_proportion)
# Plot
ggplot(pp_df, aes(x = proportion)) +
geom_histogram(binwidth = 0.005, fill = "skyblue", color = "black") +
geom_vline(xintercept = known_prev, color = "red", linetype = "dashed", size = 1) +
geom_vline(xintercept = posterior_mean, color = "blue", linetype = "solid", size = 1) +
labs(
title = "Posterior Predicted Diabetes Proportion vs NHANES Prevalence",
subtitle = paste0("Red dashed = NHANES prevalence (", known_prev,
"), Blue solid = Posterior mean (", round(posterior_mean,3), ")"),
x = "Proportion of Diabetes = 1",
y = "Frequency"
) +
theme_minimal()- We compared the Bayesian posterior-predicted diabetes prevalence with the NHANES survey-weighted estimate (8.9%, SE = 0.0048).
- The model’s posterior mean was 10.95%, with a 95% credible interval of 8.5%–12.8%. Although the model predicts slightly higher prevalence, the credible interval overlaps the NHANES estimate.
- This indicates that the model is reasonably well-calibrated, with NHANES falling near the lower end of the posterior distribution but still within a plausible range.
Code
library(dplyr)
# Posterior predicted proportion
posterior_mean <- mean(pp_proportion)
posterior_ci <- quantile(pp_proportion, c(0.025, 0.975)) # 95% credible interval
# NHANES prevalence with SE from survey::svymean
# Suppose you already have:
# svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
known_prev <- 0.089 # Mean prevalence
known_se <- 0.0048 # Standard error from survey
# Calculate 95% confidence interval
known_ci <- c(
known_prev - 1.96 * known_se,
known_prev + 1.96 * known_se
)
# Print results
data.frame(
Type = c("Posterior Prediction", "NHANES Prevalence"),
Mean = c(posterior_mean, known_prev),
Lower_95 = c(posterior_ci[1], known_ci[1]),
Upper_95 = c(posterior_ci[2], known_ci[2])
) Type Mean Lower_95 Upper_95
2.5% Posterior Prediction 0.1091237 0.09826091 0.1205293
NHANES Prevalence 0.0890000 0.07959200 0.0984080
Code
# Create a data frame for plotting
ci_df <- data.frame(
Type = c("Posterior Prediction", "NHANES Prevalence"),
Mean = c(0.1096674, 0.089),
Lower_95 = c(0.09772443, 0.079592),
Upper_95 = c(0.1210658, 0.098408)
)
# --- 1. Survey-weighted (Population) prevalence ---
pop_est <- svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
pop_prev <- as.numeric(pop_est)
pop_se <- as.numeric(SE(pop_est))
pop_ci <- c(pop_prev - 1.96 * pop_se, pop_prev + 1.96 * pop_se)
# --- 2. Bayesian posterior prevalence ---
# bayes_pred = matrix of posterior draws (iterations × individuals)
pp_proportion <- rowMeans(pp_samples) # prevalence per posterior draw
post_prev <- mean(pp_proportion) # posterior mean prevalence
post_ci <- quantile(pp_proportion, c(0.025, 0.975)) # 95% credible interval
# --- 3. Combine into one data frame ---
bar_df <- tibble(
Source = c("Survey-weighted (Population)", "Bayesian Posterior"),
Prevalence = c(pop_prev, post_prev),
CI_low = c(pop_ci[1], post_ci[1]),
CI_high = c(pop_ci[2], post_ci[2])
)
# --- 4. Plot ---
ggplot(bar_df, aes(x = Source, y = Prevalence, fill = Source)) +
geom_col(alpha = 0.85, width = 0.6) +
geom_errorbar(
aes(ymin = CI_low, ymax = CI_high),
width = 0.15,
color = "black",
linewidth = 0.8
) +
guides(fill = "none") +
labs(
title = "Population vs Posterior Diabetes Prevalence",
subtitle = "Survey-weighted estimate (design-based) vs Bayesian (model-based)",
y = "Prevalence (Proportion with Diabetes)",
x = NULL
) +
theme_minimal(base_size = 13)Comparison of Diabetes Prevalence Across Methods
The bar plot compares survey-weighted and Bayesian posterior estimates of diabetes prevalence:
- Survey-weighted prevalence: 8.9% (95% CI: 0.080–0.098)
- Bayesian posterior mean: 10.9% (95% CrI: 0.098–0.121)
- The posterior credible interval overlaps the survey 95% CI, indicating that the Bayesian model reproduces population-level prevalence accurately.
- This demonstrates good model calibration and predictive validity, while visualizing the uncertainty of both survey-based and model-based estimates.
Model Assumptions for Bayesian Logistic Regression
- Data & Observations
- The outcome is binary (0/1).
- Observations are independent, conditional on predictors.
- Predictors are measured without structural errors (no deterministic misclassification).
- Model Structure
- The relationship between predictors and the log-odds of the outcome is linear in the logit scale
- No perfect multicollinearity among predictors.
- No complete or quasi-complete separation, which would create infinite log-odds estimates.
- Priors
- Priors are proper (integrate to 1)
- Priors are informative enough to regularize estimation—especially important for logistic models with sparse data.
- Posterior Behavior
Posterior distribution is proper (i.e., well-defined).
MCMC shows good convergence:
R^≈1 1R^≈1
No divergent transitions
Effective sample sizes are adequate
Trace plots show good mixing
- Model Fit
- Posterior predictive checks show that the model can reasonably reproduce key features of the observed data.
- Predicted probabilities are well-calibrated.
- Residuals and diagnostics do not reveal major misfit.
MCMC Autocorrelation for age and BMI coefficients
Code
library(tidyr)
library(bayesplot)
library(posterior)
# Convert fitted model to draws array
post_array <- as_draws_array(bayes_fit) # draws x chains x parameters
# Plot autocorrelation for age and bmi
mcmc_acf(post_array, pars = c("b_age_c", "b_bmi_c"))Autocorrelation plots are pairwise plots (mcmc_pairs, posterior) explore correlations between parameters (age and BMI) coefficients - assess chain mixing and convergence showing correlation of each draw with its lagged values across iterations. - Rapid decay of autocorrelation toward zero indicates that the Markov chains are mixing well and successive draws are relatively independent. - Both age and BMI coefficients exhibited low autocorrelation after a few lags, supporting the reliability of posterior estimates. - This diagnostic confirms that the Bayesian model sampling was adequate and stable, ensuring valid inference from the posterior distributions. - Autocorrelation within MCMC samples for a single parameter across iterations — i.e., how strongly each sample depends on its previous ones. - It reflects MCMC chain mixing and efficiency, not relationships between parameters
Correlation Matrix
Code
library(bayesplot)
# Extract posterior draws
post <- as_draws_df(bayes_fit)
# Select numeric parameters of interest
post_subset <- post %>%
dplyr::select(b_age_c, b_bmi_c, b_sexFemale,
b_raceMexicanAmerican, b_raceNHBlack)
# Compute correlation matrix
cor_matrix <- cor(post_subset)
# Visualize
mcmc_pairs(as_draws_array(bayes_fit),
pars = c("b_age_c", "b_bmi_c", "b_sexFemale"),
off_diag_args = list(size = 1.5, alpha = 0.4))Interpretation of Posterior correlation matrix among parameters (e.g., between b_age_c, b_bmi_c, b_sexFemale)
Each diagonal panel shows the posterior density (distribution) of one parameter, while the off-diagonal scatterplots show pairwise relationships (correlations) between parameters across posterior draws.
- Marginal Distributions (diagonals)
- The histograms show that all parameters (
b_age_c,b_bmi_c,b_sexFemale) have approximately symmetric and unimodal posterior distributions. - This indicates stable and well-identified estimates with no sampling irregularities or multimodality.
- The narrow spread (small variance) reflects high precision of parameter estimates.
- Joint Distributions (off-diagonal scatterplots)
- The scatterplots show elliptical clouds centered around the mean, with no strong linear patterns. This means low posterior correlation among parameters — i.e.,
b_age_c,b_bmi_c, andb_sexFemaleare largely independent in their posterior estimates. - The absence of diagonal streaks or skewed clusters suggests that collinearity is minimal and that the MCMC chains successfully explored the parameter space.
- Subtle Correlation Notes**
- A mild negative tilt between
b_age_candb_sexFemaleindicates a slight negative correlation, meaning as the posterior estimate for age increases, the effect of being female slightly decreases. - However, this pattern is weak — confirming that both predictors contribute distinct information to the diabetes outcome.
- The absence of strong linear relationships among parameters suggests that age, BMI, and sex independently contributed to the prediction of diabetes status.
- The smooth, unimodal histograms along the diagonals confirmed stable model convergence and well-behaved posterior samples.
Code
bayes_R2(bayes_fit) # Model fit Estimate Est.Error Q2.5 Q97.5
R2 0.1313342 0.01265055 0.1064607 0.156078
Bayesian 𝑅^2 (model fit statics):
Model Fit to quantify predictive performance. - Explains about 13% of the variability in diabetes status, with credible uncertainty bounds suggesting reasonable but modest explanatory power. - but other factors (e.g., genetics, lifestyle, environment) also contribute to outcome variability.
Comparison of Odds Ratios Across Models
Code
# Combine results from remaining models (Survey-weighted and Bayesian)
svy_tbl <- if (exists("svy_or") && nrow(svy_or) > 0)
mutate(svy_or, Model = "Survey-weighted (MLE)") else NULL
bayes_tbl <- if (exists("bayes_or") && nrow(bayes_or) > 0)
mutate(bayes_or, Model = "Bayesian") %>%
filter(term != "Intercept") else NULL
# Long format
all_tbl <- bind_rows(Filter(Negate(is.null), list(svy_tbl, bayes_tbl))) %>%
mutate(
term = case_when(
grepl("bmi", term, ignore.case = TRUE) ~ "BMI (per 1 SD)",
grepl("age", term, ignore.case = TRUE) ~ "Age (per 1 SD)",
grepl("^sexFemale$", term) ~ "Female (vs. Male)",
grepl("^sexMale$", term) ~ "Male (vs. Female)",
grepl("^raceHispanic$", term) ~ "Hispanic (vs. White)",
grepl("^raceBlack$", term) ~ "Black (vs. White)",
grepl("^raceOther$", term) ~ "Other (vs. White)",
TRUE ~ term
),
OR_CI = sprintf("%.2f (%.2f – %.2f)", OR, LCL, UCL)
) %>%
select(Model, term, OR_CI)
# ➜ WIDE TABLE: Predictors as rows, Models as columns
vertical_tbl <- all_tbl %>%
pivot_wider(
names_from = Model,
values_from = OR_CI
) %>%
arrange(term)
# Print as nice table
kable(vertical_tbl,
align = "lcc",
caption = "Odds Ratios (95% CI) Across Models")| term | Survey-weighted (MLE) | Bayesian |
|---|---|---|
| Age (per 1 SD) | 3.03 (2.70 – 3.40) | 2.99 (2.64 – 3.37) |
| BMI (per 1 SD) | 1.89 (1.65 – 2.15) | 1.87 (1.71 – 2.05) |
| Female (vs. Male) | 0.53 (0.41 – 0.68) | NA |
| race:MexicanAmerican | NA | 2.00 (1.41 – 2.84) |
| race:NHBlack | NA | 1.71 (1.28 – 2.27) |
| race:Other Hispanic | NA | 1.54 (0.93 – 2.43) |
| race:Other/Multi | NA | 2.27 (1.56 – 3.28) |
| raceMexican American | 2.04 (1.49 – 2.79) | NA |
| raceNH Black | 1.67 (1.16 – 2.40) | NA |
| raceOther Hispanic | 1.59 (1.17 – 2.17) | NA |
| raceOther/Multi | 2.33 (1.55 – 3.50) | NA |
| sex:Female | NA | 0.52 (0.42 – 0.63) |
Conclusion
Model Performance and Fit
- Across multiple modeling approaches — survey-weighted MLE and Bayesian logistic regression — the models were consistent and reliable in predicting diabetes risk.
- The Bayesian model showed excellent convergence (Rhat = 1.00; Bulk/Tail ESS > 2000) and explained approximately 13% of the variance in diabetes (Posterior R² = 0.13, 95% CrI: 0.106–0.156).
- Posterior predictive checks confirmed good model calibration, with credible intervals capturing predictive uncertainty.
Key Predictors
- Age and BMI were the strongest and most consistent predictors across all methods. Each 1 SD increase in age nearly tripled the odds of diabetes, while a 1 SD increase in BMI increased the odds by approximately 1.7–1.9×.
- Sex (female) was associated with lower odds of diabetes.
- Race/Ethnicity: Mexican American, NH Black, and Other/Multi groups had significantly higher odds; the effect for Other Hispanic was less certain.
Interpretation and Robustness
- Bayesian credible intervals were slightly narrower than frequentist confidence intervals but largely overlapped, indicating robust effect estimates and stable inference.
- Prior regularization in the Bayesian model stabilized estimates in the presence of modest missingness.
- The results are slightly conservative relative to the observed population prevalence, but the posterior predictions remain consistent with both survey-weighted and imputed data.
Design Considerations
- Survey-weighted MLE accounts for the complex NHANES sampling design, producing population-representative estimates.
- The Bayesian model used normalized weights as importance weights, which approximates the effect of survey weights but does not fully account for stratification, clustering, or design-based variance adjustments.
Overall: - Age and BMI are consistently strong risk factors for diabetes in this population. - The Bayesian model complements frequentist approaches by providing stable, interpretable, and uncertainty-quantified estimates, while broadly reproducing population-level prevalence.
Discussions
- The use of multiple imputation allowed for robust analysis despite missing data, increasing power and reducing bias.
- Comparison of frequentist and Bayesian models demonstrated consistency in significant predictors, while Bayesian approaches provided the advantage of posterior distributions and probabilistic interpretation.
- Across all models, both age and BMI emerged as strong and consistent predictors of diabetes.
- The consistency across modeling approaches strengthens the validity of these findings Multiple imputation accounted for potential biases due to missing data, and Bayesian modeling provided robust credible intervals that closely matched frequentist estimates align with previous epidemiological research indicating that increasing age and higher BMI are among the most important determinants of type 2 diabetes risk.
- Cumulative exposure to metabolic and lifestyle risk factors over time, and the role of excess adiposity and insulin related effects account for diabetes.
- Survey weighted dataset strenghthens ensuring population representativeness, multiple imputation to handle missing data, and rigorous Bayesian estimation provided high effective sample sizes and R̂ ≈ 1.00 across parameters confirmed excellent model convergence.
- Bayesian logistic regression provided inference statistically consistent and interpretable achieving the aim of this study. In future research hierarchical model using NHANES cycles or adding variables (lab tests) could assess nonlinear effects of metabolic risk factors.
Limitations
- The study is based on cross-sectional/observational NHANES data, which limits the ability to make causal inferences. Associations observed between BMI, age, diabetes status cannot confirm causation.
- The project relies on multiple imputation for missing values, even though imputation reduces bias, it assumes missingness is at random (MAR); if data are missing not at random (MNAR), results may be biased.
- Potential Residual Confounding - Models included key predictors (age, BMI, sex, race), but unmeasured factors like diet, physical activity, socioeconomic status, or genetic predisposition could confound associations.
- Bayesian models depend on prior choices, which could influence posterior estimates if priors are informative or mis-specified.
- Variable Measurement - BMI is measured at a single time point, which may not reflect long-term exposure or risk.
- Self-reported variables - are subjective to recall or reporting bias.
- Interactions are not tested in the study (bmi × age) and so other potential synergistic effects might be missed.
- Predicted probabilities are model-based estimates, not validated for clinical decision-making. External validation in independent cohorts is needed before application.
Targeted Therapy
- Translational Perspective from the Bayesian Diabetes Prediction Project. This project further demonstrates the translational potential of Bayesian modeling in clinical decision-making and public health strategy.
- By using patient-level predictors such as age, BMI, sex, and race to estimate the probability of diabetes, the model moves beyond descriptive statistics toward individualized risk prediction.
- The translational move lies in converting these probabilistic outputs into actionable thresholds—such as identifying the BMI or age at which the predicted risk of diabetes exceeds a clinically meaningful level (e.g., 30%).
- Such insights can guide early screening, personalized lifestyle interventions, and targeted prevention programs for populations at higher risk.
- This approach embodies precision public health—bridging data science and medical decision-making to deliver tailored, evidence-based strategies that can ultimately improve diabetes prevention and management outcomes.
What changes in modifiable predictors would lower diabetes risk?
Translational Research Implications:
- We can use the model to guide prevention or intervention.
- Only BMI is a modifiable risk factor
- We can make changes in BMI (behavior or lifestyle) to achieve a lower risk threshold
- we hold non modifiable predictors as constant (sex, race).
- Vary modifiable predictors (BMI) until the model predicts the desired probability.
Validation
Internal validation
- To illustrate personalized risk estimation using the Bayesian model, we computed the posterior predicted probability of diabetes for a representative participant.
- We selected one participant from the dataset (adult[1, ]) including all relevant covariates (age, BMI, sex, race).
- Used posterior_linpred with transform = TRUE to obtain predicted probabilities for logistic regression.
- Extracted posterior draws computed 95% credible interval from the posterior draws.
- Density plot shows the distribution of plausible probabilities given the participant’s covariates.
- The density highlights uncertainty around the individual’s predicted diabetes risk.
- 95% credible interval provides a range of probable outcomes, not just a point estimate.
- This approach allows personalized risk assessment, enabling clinicians or public health practitioners to identify high-risk individuals
- Tailor preventive interventions (e.g., lifestyle modification, monitoring)
- Quantify uncertainty in predictions for decision-making
- Posterior predictive distributions enable probabilistic, individualized predictions, supporting targeted intervention strategies beyond population-level summaries.
Code
# Use the first participant
# using multiple covariates to select someone
participant1_data <- adult[1, ]
# predicted probabilities for patient 1
phat1 <- posterior_linpred(bayes_fit, newdata = participant1_data, transform = TRUE)
# 'transform = TRUE' gives probabilities for logistic regression
# Store in a data frame for plotting
post_pred_df <- data.frame(pred = phat1)
# Compute 95% credible interval
ci_95_participant1 <- quantile(phat1, c(0.025, 0.975))
# Plot
ggplot(post_pred_df, aes(x = pred)) +
geom_density(color='darkblue', fill='lightblue') +
geom_vline(xintercept = ci_95_participant1[1], color='red', linetype='dashed') +
geom_vline(xintercept = ci_95_participant1[2], color='red', linetype='dashed') +
xlab('Probability of being diabetic (Outcome=1)') +
ggtitle('Posterior Predictive Distribution 95% Credible Interval') +
theme_bw()Code
participant2_data <- adult[2, ]
# predicted probabilities for patient 1
phat2 <- posterior_linpred(bayes_fit, newdata = participant2_data, transform = TRUE)
# 'transform = TRUE' gives probabilities for logistic regression
# Store in a data frame for plotting
post_pred_df2 <- data.frame(pred = phat2)
# Compute 95% credible interval
ci_95_participant2 <- quantile(phat2, c(0.025, 0.975))
# Plot
ggplot(post_pred_df2, aes(x = pred)) +
geom_density(color='darkblue', fill='lightblue') +
geom_vline(xintercept = ci_95_participant2[1], color='red', linetype='dashed') +
geom_vline(xintercept = ci_95_participant2[2], color='red', linetype='dashed') +
xlab('Probability of being diabetic (Outcome=1)') +
ggtitle('Posterior Predictive Distribution 95% Credible Interval') +
theme_bw()External validation
- Predicting Diabetes Risk for a New Participant to demonstrate the application of the Bayesian model for personalized prediction, we applied the trained model to a new participant not included in the original dataset.
- Selected a new participant with specific covariates (age, BMI, sex, race).
- Used posterior_linpred with transform = TRUE to compute posterior predicted probabilities of diabetes. Generated posterior draws to capture predictive uncertainty.
- Created a density plot of predicted probabilities. Computed 95% credible interval to summarize the range of likely outcomes.
- Red dashed lines indicate the lower and upper bounds of the interval.
- The distribution shows not only the most probable risk but also the uncertainty around it.
- Credible intervals help quantify confidence in individual-level predictions.
- Supports personalized decision-making, such as targeted lifestyle interventions, early monitoring, or preventive care.
- Bayesian posterior predictive draws allow probabilistic, individualized predictions for new participants, providing both point estimates and uncertainty measures for actionable risk assessment.
Code
library(ggplot2)
new_participant <- data.frame(
age_c = 40,
bmi_c = 25,
sex = "Female",
race = "Mexican American"
)
# Posterior predicted probabilities
phat_new <- posterior_linpred(bayes_fit, newdata = new_participant, transform = TRUE)
# Convert to numeric vector
phat_vec <- as.numeric(phat_new)
# Check the range to see if all values are similar
range(phat_vec)[1] 1 1
Code
# Store in a data frame
post_pred_df_new <- data.frame(pred = phat_vec)
# Compute 95% credible interval from the vector
ci_95_new_participant <- quantile(phat_vec, c(0.025, 0.975))
# Plot
ggplot(post_pred_df_new, aes(x = pred)) +
geom_density(color='darkblue', fill='lightblue', alpha = 0.6) +
geom_vline(xintercept = ci_95_new_participant[1], color='red', linetype='dashed') +
geom_vline(xintercept = ci_95_new_participant[2], color='red', linetype='dashed') +
xlim(0, 1) + # ensures you see the curve even if values are close
xlab('Probability of being diabetic (Outcome=1)') +
ggtitle('Posterior Predictive Distribution (95% Credible Interval)') +
theme_bw()## Trageted BMI Estimation for Predicted Diabetes Risk
- To analyze the relationship between BMI and the predicted probability of diabetes, holding other covariates (age, sex, race) constant, via fitted Bayesian logistic regression model, we generated a grid of BMI values (e.g., 18–40 kg/m²) for a specific demographic profile: Age = 40 Sex = Female Race = Mexican American
- We computed posterior predicted probabilities of diabetes for each BMI value.
- Averaged across posterior draws to obtain the mean predicted probability per BMI.
- Target Probability Approach Defined a target probability of diabetes (e.g., 0.3). Identified the BMI value whose predicted probability is closest to the target. This enables inverse prediction, linking statistical inference to clinically meaningful thresholds.
- Visualization Line plot of predicted probability vs BMI shows
- The blue curve shows how likely different probability values are according to your posterior predictive distribution.-
- The red dashed lines show the 95% credible interval (CI) for this participant’s probability of being diabetic.
- Limits x-axis between 0 and 1 (valid range for probabilities).
- Red dashed horizontal line: target probability (0.3).
- Red dotted vertical line: BMI corresponding to the target probability (~closest BMI).Annotated to highlight the BMI threshold.
- The blue curve shows how likely different probability values are according to your posterior predictive distribution.-
- Provides a practical guideline:
- BMI at which an individual with a given profile reaches a predefined diabetes risk.
- Supports personalized risk communication and preventive interventions.
- Translates model output into actionable, clinically relevant thresholds, bridging research findings with public health application.
- This approach demonstrates how Bayesian posterior predictions can be used for targeted, individualized risk assessment, informing precision prevention strategies based on modifiable risk factors like BMI.
Practical Implications
- age and BMI as robust and independent predictors of diabetes, underscore the importance of early targeted interventions in mitigating diabetes risk.
- Longitudinal studies and combining other statistical analytical methods with Bayesian can further enhance and provide better informed precision prevention strategies.
Code
# Grid of possible BMI values (centered if model used bmi_c)
bmi_seq <- seq(18, 40, by = 0.5)
newdata_grid <- data.frame(
age_c = 40,
bmi_c = bmi_seq,
sex = "Female",
race = "Mexican American"
)
# Posterior mean predicted probabilities
pred_probs <- posterior_linpred(bayes_fit, newdata = newdata_grid, transform = TRUE)
# Average over posterior draws to get the mean predicted probability per BMI
prob_mean <- colMeans(pred_probs)
# Combine with BMI values
pred_df <- cbind(newdata_grid, prob_mean)
target_prob <- 0.3
closest <- pred_df[which.min(abs(pred_df$prob_mean - target_prob)), ]
closest age_c bmi_c sex race prob_mean
1 40 18 Female Mexican American 1
Code
ggplot(pred_df, aes(x = bmi_c, y = prob_mean)) +
geom_line(color = "darkblue", linewidth = 1.2) +
geom_hline(yintercept = target_prob, color = "red", linetype = "dashed") +
geom_vline(xintercept = closest$bmi_c, color = "red", linetype = "dotted") +
annotate("text", x = closest$bmi_c, y = target_prob + 0.05,
label = paste0("Target BMI ≈ ", round(closest$bmi_c, 1)),
color = "red", hjust = -0.1) +
labs(
x = "BMI (centered or raw, depending on model)",
y = "Predicted Probability of Diabetes",
title = "Inverse Prediction: BMI Needed for Target Diabetes Risk"
) +
theme_bw()Code
# Posterior predictive prevalence (replicated datasets)
yrep <- brms::posterior_predict(bayes_fit, ndraws = 2000) # draws × observations (0/1)
post_prev <- rowMeans(yrep) # prevalence for each posterior draw
# Survey-weighted observed prevalence (population estimate)
des_obs <- survey::svydesign(
id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
nest = TRUE, data = adult_imp1
)
# Compute survey-weighted diabetes prevalence
obs_est <- survey::svymean(~diabetes_dx, des_obs, na.rm = TRUE)
# Extract safely
obs_prev <- as.numeric(obs_est["diabetes_dx"])
obs_se <- as.numeric(SE(obs_est)["diabetes_dx"])
# Compute CI only if not NA
if (is.na(obs_prev) | is.na(obs_se)) {
obs_lcl <- NA
obs_ucl <- NA
} else {
obs_lcl <- max(0, obs_prev - 1.96 * obs_se)
obs_ucl <- min(1, obs_prev + 1.96 * obs_se)
}
# Cleaned text for CI
ci_text <- if (is.na(obs_lcl) | is.na(obs_ucl)) {
"(95% CI unavailable)"
} else {
sprintf("(95%% CI %.1f–%.1f%%)", 100 * obs_lcl, 100 * obs_ucl)
}
# Plot: posterior density with weighted point estimate and 95% CI band
library(ggplot2)
ggplot(data.frame(prev = post_prev), aes(x = prev)) +
geom_density(alpha = 0.6) +
# Show CI band only if valid
{if (!is.na(obs_lcl) & !is.na(obs_ucl))
annotate("rect", xmin = obs_lcl, xmax = obs_ucl, ymin = 0, ymax = Inf, alpha = 0.15)} +
geom_vline(xintercept = obs_prev, linetype = 2) +
coord_cartesian(xlim = c(0, 1)) +
labs(
x = "Diabetes prevalence",
y = "Posterior density",
subtitle = sprintf(
"Survey-weighted NHANES prevalence = %.1f%% %s",
100 * obs_prev, ci_text
)
) +
theme_minimal()Interpretation:
Observed NHANES prevalence: ≈ 8.9%
Posterior predictive mean: roughly 8–9%
Conclusion: The Bayesian logistic regression model predicts population-level diabetes prevalence consistent with NHANES survey estimates.
→ This indicates strong model calibration and external validity.